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The demand for credit is increasing constantly. Banks are looking for 
various methods of credit evaluation that provide the most accurate results in 
a shorter period in order to minimize their rising risks. This study focuses on 
various methods that enable the banks to increase their asset quality without 
market loss regarding the credit allocation process. These methods enable 
the automatic evaluation of loan applications in line with the sector 
practices, and enable determination of credit policies/strategies based on 
actual needs. Within the scope of this study, the relationship between the 
predetermined attributes and the credit limit outputs are analyzed by using a 
sample data set of consumer loans. Random forest (RF), sequential minimal 
optimization (SMO), PART, decision table (DT), J48, multilayer perceptron 
(MP), JRip, naive Bayes (NB), one rule (OneR) and zero rule (ZeroR) 
algorithms were used in this process. As a result of this analysis, SMO, 


PART and random forest algorithms are the top three approaches for 
determining customer credit limits. 
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1. INTRODUCTION 

Data analysis and data mining techniques are used in many domains to convert raw data into 
knowledge [1]. These methods are applied on various fields such as medical diagnosis [2], travel 
reccomenders [3]-[5], fraud detectors [6], numerous classification problems [7] and many more. Banking 
domain is one of these fields with a wide range of data analysis needs [8]. Credit assessment process is an 
important data mining application area in this domain. 

One of the main activities of a bank is credit lending. In order to improve market share and increase 
sales profitability, banks need to develop a good credit assessment process in which they can carry out 
lending [9]. This is an important requirement for banks due to the constantly evolving market and increasing 
competition. The credit assessment process mainly aims to make an analysis to determine whether the party 
requesting the loan has the power to fulfill its obligation to repay the loan at the end of the loan agreement 
and to reduce the likelihood of non-payment of the loan as much as possible by determining whether it has 
the willingness to pay the loan [10]. 

Developing a unique credit assessment system which produces accurate, stable and reliable results 
in a short time is an important advantage in competitive market conditions. Additionally, such a unique credit 
assessment system, which should be equipped with safe techniques, should help the bank achieve its strategic 
goals in the corporate field and create user satisfaction by meeting the expectations of the customers. For this 
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reason, many banks use data mining methods to extract meaningful information from customer data and try 
to manage the loan evaluation process by taking this information as a core reference [11]-[17]. 

In this study, the characteristics of people requesting credit were evaluated and the effects of these 
features on the available credit limit was examined. The main purpose of this study is to propose a 
decision-making approach to minimize the risk of non-repayment of loans. In order to achieve this goal, a 
sample data set containing attributes related to credit allocation is used. The relationship between the data set 
and the credit limit, which was harmonized with data mining methods, was examined with random forest 
(RF), sequential minimal optimization (SMO), PART, decision table (DT), J48, multilayer perceptron (MP), 
JRip, naive Bayes (NB), one rule (OneR) and zero rule (ZeroR) algorithms. The obtained metrics of the 
produced prediction models were compared and analyzed. The models with the most accurate prediction 
results were proposed as possible decision-making approaches in credit allocation process. The rest of the 
paper is organized as follows: section 2 describes the data gathering process and includes a background of the 
data mining algorithms used in this research. Section 3 presents the obtained results and section 4 contains 
conclusion and future plans of this study. 


2. RESEARCH METHOD 

A total of 10 different data mining algorithms were used in this research to find the best approach 
for determining customer credit limits. The data set and applied data mining algorithms are described in 
detail in the following subsections. 


2.1. Data gathering and processing 

The data set reflecting the characteristics of possible customer credit requests and credit limits was 
generated by a banking specialist. 401 records were collected to run data experiments. It has 14 input 
attributes derived from 4 main attribute groups and it has one output attribute which is the corresponding 
credit limit. Table 1 lists attribute groups, attributes and their descriptions. Credit limit attribute was 
discretized into 7 categories based on its data range. Table 2 lists credit limit categories by minimum and 
maximum credit values. 


Table 1. Data set attributes 


Attribute groups Definition Attributes 
Monthly income-installment Customer’s ratio of monthly income over - Up to 50% 
ratio (MIIR) installment. - Between 50% and 60% 
- Between 60% and 70% 
- Between 70% and 80% 
Income type Customer’s income type. - Private sector paid worker 


- Private sector unpaid worker 

- Public sector paid worker 

- Public sector unpaid worker 

Investigation result level Customer’s status about previous credit payments. - Reject (Bad reputation according to past 

payment history) 

- Unclear 

- Accept (Good reputation according to past 
payment history) 


Risk level Customer’s risk level based on investigation result - High 
level and previous banking history. - Medium 
- Low 
Credit limit Customer’s credit limit based on Monthly Income- - Credit Limit 


Installment Ratio, Income Type, Investigation 
Result Level and Risk Level attributes. 


Table 2. Credit limit categories 


Credit limit category Minimum credit Maximum credit 
value (in TL) value (in TL) 
Group 1 0 8200 
Group 2 8201 16400 
Group 3 16401 24600 
Group 4 24601 32800 
Group 5 32801 41000 
Group 6 41001 49200 
Group 7 49201 82000 
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Preprocessed data set was converted to attribute-relation file format (ARFF) to run data mining 
algorithms. Each numeric attribute in the ARFF data file may contain 0 or 1 where 1 represents true and 0 
represents false. Figure 1 shows the structure of the ARFF data file. Info gain attribute eval algorithm [18] is 
used to measure the information gain of each attribute against the credit limit attribute (output class). 
According to the attribute rankings unclear investigation result level attribute has no effect on decreasing the 
overall entropy, therefore it has been removed from the data set. Table 3 lists information gain attribute 
rankings. 


@relation data 


@attribute MIIR_5@ numeric 

@attribute MIIR_5@6@ numeric 

@attribute MIIR_6@7@ numeric 

@attribute MIIR_7@8@ numeric 
@attribute PublicSector_Paid numeric 
@attribute PublicSector_Unpaid numeric 
@attribute PrivateSector_Paid numeric 
@attribute PrivateSector_Unpaid numeric 
@attribute LowRiskLevel numeric 
@attribute MediumRiskLevel numeric 
@attribute HighRiskLevel numeric 
@attribute InvestigationAccept numeric 
@attribute InvestigationUnclear numeric 
@attribute InvestigationReject numeric 
@attribute CreditLimit {CAT1,CAT2,CAT3,CAT4, CATS, CAT6,CAT7} 


Figure 1. Data file structure 


Table 3. Information gain attribute rankings 


Attribute Name Attribute Rank 
MIIR_7080 0.608 
MIIR_SO 0.475 
PrivateSector_Unpaid 0.449 
MIIR_6070 0.438 
InvestigationAccept 0.4 
InvestigationReject 0.367 
LowRiskLevel 0.283 
HighRiskLevel 0.226 
PrivateSector_Paid 0.213 
PublicSector_Paid 0.197 
MIIR_5060 0.183 
MediumRiskLevel 0.154 
PublicSector_Unpaid 0.13 
InvestigationUnclear 0 


2.2. Data mining algorithms 
RF, SMO, PART, DT, J48, MP, JRip, NB, OneR and ZeroR algorithms were used for building 
prediction models for classification of customer credit limits. Brief descriptions of these algorithms are listed: 

— Random forest (RF): RF is an ensemble learning method which is a mixture of several tree-based 
predictors. Output of the algorithm is the mode of the classification classes that the forest contains [19]. 

— Support vector machines (SVM) and sequential minimal optimization (SMO): SVM is a supervised 
learning approach for classification tasks in machine learning domain. The algorithm tries to find a 
hyperplane to classify data points in a given data set distinctly. SMO is an optimization in SVM which 
solves the quadradic programming problem in SVM model training phase [20], [21]. 

— PART: Generates classification rules by creating partial decision trees without global optimization. The 
algorithm is an efficient separate-and-conquer rule learning technique [22]. 

— Decision table (DT): DT is a simple but effective supervised learning algorithm used in classification 
problems. It has a collection of rules called decision list which is used in the classification process. Each 
tule is processed sequentially until a matching rule is found [23]. 

— J48: J48 is a similar version of C4.5 decision tree algorithm which is implemented using Java 
programming language. The algorithm is capable of generating a pruned/unpruned decision tree based on 
information entropy [24]. 
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— Multilayer perceptron (MP): MP is a feed-forward artificial neural network which uses backpropogation 
for training classification models [25]. 

— JRip: It is the implementation of repeated incremental pruning to produce error reduction (RIPPER) 
approach which is a rule-based classifier [26]. 

— Naive bayes (NB): NB is probabilistic classification approach based on Bayes' theorem. It assumes that a 
feature of a class is independent of any other feature [27]. 

— One rule (OneR): It is a simple classification algorithm which tries to find the one rule with the minimum 
prediction error according to a training data set [28]. 

— Zero rule (ZeroR): This algorithm is another simple classification approach which ignores predictors 
other than the target (class) attribute. The mean is calculated for numeric classes and mode is computed 
for nominal classes [29]. This study uses ZeroR for determining the baseline performance for machine 
learning algorithms. 


2.3. Comparing algorithms 

Accuracy, precision and recall metrics are computed for comparing data mining models used in this 
study. These metrics are calculated using true positive (TP), true negative (TN), false positive (FP) and false 
negative (FN) values extracted from confusion matrices of prediction models. A confusion matrix holds the 
number of actual values in its rows and keeps the number of predicted values in its columns. 

TP is the number of values where the classification model correctly predicts the positive class 
instances. In a similar way, TN is the number of values where the classification model correctly predicts the 
negative class instances. FP respresents the number of incorrectly predicted positive class instances whereas, 
FN is the number of values where the classification model incorrectly predicts the negative class instances. 
Accuracy, precision and recall scores are calculated according to the described TP, TN, FP and FN values. 
Accuracy is the percentage of correctly predicted instances. Precision is the radio of TP values divided by the 
sum of TP and FP values. Recall is the ratio of TP values divided by the sum of TP and FN values. 


2.4. Building classification models 

Waikato environment for knowledge analysis (WEKA) is an open-source machine learning platform 
[30]. It is used for training and testing the classification models with the described bank-customer data set. 
Each model is tested with both 10-fold cross-validation and 66% percentage-split methods. Comparison 
metrics are collected for both of these testing approaches and results are discussed in the next section. 


3. RESULTS AND DISCUSSION 

The algorithms mentioned in section 2.2 are applied on the final version of the banking data set. 
Two test experiments are conducted based on two different testing approaches. Accuracy, precision and 
recall metrics are calculated based on these experiments. Obtained results are listed in Table 4. 


Table 4. Performance results of classification models 


Algorithm Experiment 1: 10-Fold Cross-Validation Experiment 2: 66% Split Test 

Accuracy (%) Precision Recall Accuracy (%) Precision Recall 

RF 96.76 0.97 0.97 93.38 0.94 0.93 
SMO 96.51 0.97 0.97 93.38 0.95 0.93 
PART 96.51 0.97 0.97 93.38 0.95 0.93 
DT 95.51 0.96 0.96 90.44 0.92 0.90 
J48 95.51 0.96 0.96 91.91 0.94 0.92 
MP 95.26 0.95 0.95 91.91 0.94 0.92 
JRip 94.51 0.95 0.95 91.91 0.94 0.92 
NB 86.03 0.90 0.86 84.56 0.88 0.85 
OneR 44.14 0.55 0.44 38.24 0.49 0.38 
ZeroR 26.69 0.26 0.26 22.06 0.22 0.22 


According to the performance results of classification models, RF algorithm has the best 
classification accuracy and recall scores in both of the experiments. And it has the best classification 
precision score in Experiment 1 whereas SMO and PART algorithms have the highest classification precision 
score in Experiment 2. All classification models produced better scores than ZeroR baseline performance. 

Table 5 shows comparison results of prediction models. Each row in the table starts with a 
prediction model. The columns after the prediction model list the names of other algorithms that produced 
lower results than the model in that row according to the accuracy, precision and recall scores. These results 
are listed for both experiments. The last column contains information about how many times the prediction 
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model in the row produces better results than other algorithms. For example, SMO and PART has better 
results in 43 comparisons with other algorithms based on the results of both experiments whereas RF has 
better results in 41 comparisons. J48, DT, MP, JRip, NB, OnerR and ZeroR algorithms have no better 
comparison results than 27. Based on the presented results in Tables 4 and 5, RF algorithm has the highest 
scores in most of the metrics for both of the experiments whereas SMO and PART algorithms have the 
greatest number of better comparison results. 


Table 5. Comparison results of classificaiton models 


Algorithm Experiment 1: 10-Fold Cross-Validation Experiment 2: 66% Split Number 
Has Better Has Better Has Better Has Better Has Better Has Better of 
Accuracy Precision Recall Than Accuracy Than Precision Recall Than Better 
Than Than Than Results 
SMO DT, J48, MP, DT, J48, MP, DT, J48, MP, J48, MP, JRip, RF, J48, MP, J48, MP, JRip, 43 
JRip, NB, JRip, NB, JRip, NB, DT, NB, OneR, JRip, DT, DT, NB, OneR, 
OneR, ZeroR OneR, ZeroR OneR, ZeroR ZeroR NB, OneR, ZeroR 
ZeroR 
PART DT, J48, MP, DT, J48, MP, DT, J48, MP, J48, MP, JRip, RF, J48, MP, J48, MP, JRip, 43 
JRip, NB, JRip, NB, JRip, NB, DT, NB, OneR, JRip, DT, DT, NB, OneR, 
OneR, ZeroR OneR, ZeroR OneR, ZeroR ZeroR NB, OneR, ZeroR 
ZeroR 
Random SMO, PART, DT, J48, MP, DT, J48, MP, J48, MP, JRip, DT, NB, J48, MP, JRip, 41 
Forest DT, J48, MP, JRip, NB, JRip, NB, DT, NB, OneR, OneR, ZeroR DT, NB, OneR, 
JRip, NB, OneR, ZeroR OneR, ZeroR ZeroR ZeroR 
OneR, ZeroR 
J48 MP, JRip, NB, MP, JRip, MP, JRip, DT, NB, OneR, DT, NB, DT, NB, OneR, 27 
OneR, ZeroR NB, OneR, NB, OneR, ZeroR OneR, ZeroR ZeroR 
ZeroR ZeroR 
Decision MP, JRip, NB, MP, JRip, MP, JRip, NB, OneR, NB, OneR, NB, OneR, 24 
Table OneR, ZeroR NB, OneR, NB, OneR, ZeroR ZeroR ZeroR 
ZeroR ZeroR 
Multilayer JRip, NB, NB, OneR, NB, OneR, DT, NB, OneR, DT, NB, DT, NB, OneR, 22 
Perceptron OneR, ZeroR ZeroR ZeroR ZeroR OneR, ZeroR ZeroR 
JRip NB, OneR, NB, OneR, NB, OneR, DT, NB, OneR, DT, NB, DT, NB, OneR, 21 
ZeroR ZeroR ZeroR ZeroR OneR, ZeroR ZeroR 
Naive Bayes OneR, ZeroR OneR, ZeroR OneR, ZeroR OneR, ZeroR OneR, ZeroR OneR, ZeroR 12 
OneR ZeroR ZeroR ZeroR ZeroR ZeroR ZeroR 6 
ZeroR None None None None None None 0 


4. CONCLUSION 

This study shows a detailed data mining algorithm comparison for customer credit allocation 
process using two different experiment sets. RF, SMO, PART, DT, J48, MP, JRip, NB, OneR and ZeroR 
algorithms are trained and tested on a banking data set which has characteristics of possible customer credit 
requests and credit limits. Obtained test results suggest that SMO, PART and RF algorithms are the most 
accurate three data mining approaches for customer credit allocation process. Proposing a hybrid data mining 
solution using SMO, PART and RF algorithms for the implementation of a customer credit allocation tool is 
planned as a future study. 
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